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Abstract —This paper introduces the necessary and sufficient 
conditions that surrogate functions must satisfy to properly 
define frontiers of non-dominated solutions in mniti-objective 
optimization problems. These new conditions work directly on the 
objective space, thus being agnostic about how the solutions are 
evainated. Therefore, real objectives or user-designed objectives’ 
surrogates are allowed, opening the possibility of linking indepen¬ 
dent objective snrrogates. To illustrate the practical consequences 
of adopting the proposed conditions, we nse Gaussian processes 
as surrogates endowed with monotonicity soft constraints and 
with an adjnstable degree of flexibility, and compare them to 
regular Gaussian processes and to a frontier surrogate method 
in the literature that is the closest to the method proposed 
in this paper. Results show that the necessary and sufficient 
conditions proposed here are finely managed by the constrained 
Gaussian process, guiding to high-qnaiity snrrogates capable of 
suitably synthesizing an approximation to the Pareto frontier 
in challenging instances of multi-objective optimization, while 
an existing approach that does not take the theory proposed 
in consideration defines snrrogates which greatly violate the 
conditions to describe a valid frontier. 

Index Terms —Ganssian processes; Necessary and sufficient 
conditions; Non-dominated frontier; Snrrogate functions. 

I. Introduction 

ULTI-OBJECTIVE optimization (MOO), also called 
multiple criteria optimization jT], is an extension of the 
standard single-objective optimization, where the objectives 
may be conflicting with each other ||2l, El. When a conflict 
exists, we are no more looking for a single optimal solution 
but for a set of solutions, each one providing a trade-off on the 
objectives and none being better than the others. This solution 
set is called the Pareto set and its counterpart in the objective 
space is denoted the Pareto frontier. 

The Pareto frontier is at the core of MOO algorithms, being 
the foundation of many methods devoted to evaluating the 
performance and comparing the solutions to each other ||4l. 
However, the frontier is defined by the objectives, which can 
be expensive to compute El, El, El. This leads to a variety 
of surrogate methods that try to approximate the objectives, 
e.g. H, El, thus saving computational resources. 

Among the surrogates that directly or indirectly estimate the 
Pareto frontier, one introduced by Yun et al. m is the closest 
to the surrogate described in this paper. They used a one-class 
SVM to define a function over the objective space whose null 
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space describes an approximation of the Pareto frontier. This 
function is used to select individuals, since its value increases 
as its argument becomes more distant from the frontier, which 
are then used for crossover in a genetic algorithm. 

Loshchilov et al. ED presented a similar SVM approach, 
but the function learnt is defined over the decision space, 
which allows direct comparison with the Pareto frontier ap¬ 
proximation without requiring evaluation of the objectives. 
This direct comparison can also be achieved with estimates 
built over the objective space by integrating surrogates for 
the objectives. However, contrary to the one-class SVM that 
learns a model to fit all samples on one side of the approximate 
frontier, the proposed SVM is also able to consider points that 
dominate the frontier being approximated, allowing approxi¬ 
mation of multiple Pareto frontiers, each defined by a class of 
points in non-dominated sorting ifT^ . 

In a different approach, Loshchilov et al. E3ll approximated 
the Pareto dominance instead of the Pareto frontier by using 
a rank-based SVM. In this case, instead of providing only the 
data points, the algorithm is also informed about the preference 
for an arbitrary number of sample parrs and tries to find a 
function where higher evaluation represents higher preference. 
Using the Pareto dominance to establish the preference be¬ 
tween points and learning directly from the decision space, 
candidate solutions can be compared in dominance using the 
learnt function. However, both ED and E3 try to estimate the 
Pareto frontier using generic function approximation models, 
which do not take into account the particularities of the Pareto 
frontier. 

It is possible to guarantee that the Pareto frontier’s estimate 
is valid by building conservative estimates. Eor instance, using 
a binary random field over the objective space to model 
the boundary between dominated and non-dominated regions. 
Da Fonseca and Fonseca El described a theory that can 
be used to assess the statistical performance of a stochastic 
optimization algorithm and compare different algorithms. The 
attainment function described in the paper defines the prob¬ 
ability that a run of the stochastic algorithm will dominate 
the function’s arguments. Although the attainment function is 
hard to compute, it can be approximated by multiple runs of 
the underlying algorithm, which makes it a good candidate for 
analyzing the performance statistics of the optimization algo¬ 
rithm and for performing hypothesis testing between MOO 
algorithms. 

If a single run is considered, then the approximate attain¬ 
ment function describes a valid estimate of the Pareto frontier 
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and it is defined as the border of the region dominated by the 
points provided. Although valid, this estimate is very conser¬ 
vative and does not interpolate between the points provided, 
which means it cannot provide a good idea of the frontier’s 
shape and any evaluation of new points could be performed 
using only dominance comparison with the provided points. 

In this paper, we develop a theory that defines necessary and 
sufficient conditions for a functional description of a Pareto 
frontier. Based on this theory, the search for approximations 
for the Pareto frontier using surrogate functions should be 
constrained to, or at least focused on, the ones that satisfy 
the results. If not, the resulting manifold obtained from the 
function may have any shape, possibly with many dominated 
points, which could result in reduced performance. 

Moreover, the theory is developed on the objective space, 
allowing either accurate or approximate objective evaluations 
to be used, without restricting the format of the objectives’ 
surrogates. If parametric surrogate objectives are used, their 
association with the Pareto frontier surrogate can provide 
feedback on how to adjust their parameters so that the ap¬ 
proximation is closer to the real objectives. 

As an example of how to integrate the theoretical conditions 
in a surrogate design, we show how to introduce the theoretical 
conditions as soft constraints in Gaussian processes m, 
which are nonparametric models, thus being able to adjust 
to variable number of samples, and whose hyper-parameters 
can be easily optimized. 

To validate the hypothesis that surrogate methods that do 
not consider this theory may define invalid Pareto frontier 
approximations, the constrained Gaussian process is compared 
to a regular Gaussian process and to an existing SVM-based 
surrogate ifTOll and results show that the soft constrained Gaus¬ 
sian process finds good approximations maximally obeying the 
constraints according to the degree of flexibility of the model. 
On the other hand, the models that do not take into account 
the theory can violate greatly and arbitrarily the conditions for 
a valid Pareto frontier. 

This paper is organized as follows. Section|II]introduces the 
notation and principles of multi-objective optimization used 
in this paper. Section Hill shows the conditions that a function 
must satisfy to define a Pareto frontier. These conditions are 
then used in Section |IV] to build a function to approximate 
a frontier given some points on it and the approximation 
is compared to an existing surrogate. Finally, Section [V] 
summarizes the findings and points out future directions for 
research. 

II. Multi-Objective Optimization 

A multi-objective optimization (MOO) problem is defined 
by a decision space X and a set of objective functions 
gi{x ): A” —> i e {1,.. . ,M}, where 3^^ C R fTh). Since 

the framework is the same for maximization or minimization, 
we will consider that minimization is desired in all objectives. 
For a given point x in the decision space, the point defined 
by its evaluation using the objectives y = {gi{x),..., Pm (a;)) 
is its counterpart in the objective space y = yi x • • • x 3^m- 

Although the objective space usually only makes sense 
when coupled with the decision space and objectives, which 


allows for its infeasible region and Pareto frontier to be 
defined, we will work only with the objective space in this 
paper, which means that the results hold for any problem. We 
will also consider that y = , since any restriction for a 

specific problem is defined by means of the objectives and 
decision space constraints, and are handled transparently. 

Furthermore, we assume that the optimal solutions describe 
a set of M—1 manifolds on K^, which correspond to curves in 
the 2D case and surfaces in the 3D case. Most multi-objective 
optimization problems have solutions with this property, with 
noticeable exceptions, such as: i) problems where some of the 
objectives do not conflict, so that only one of them should be 
used in the MOO problem with the other conflicting objectives, 
while the optimality of the ignored objectives is guaranteed 
because they were redundant; and ii) some problems with less 
decision variables D than objectives M, such as the Viennet 
function El. 

Since we are dealing with an optimization problem, we must 
define operators to compare solutions, like the operators < and 
< are used in the mono-objective case. In MOO, this operator 
is the dominance. 

Definition 1 (Dominance). Let y and y' be points in 
the objective space. Then y dominates y' , denoted y A y' , if 
Vi < ui for all i. 

The definition of dominance used in this paper is the same 
provided in IH, which allows a point to dominate itself. This 
relation is usually called weak dominance, but we call it 
“dominance” for simplicity, since it is the main dominance 
relation used in this paper. Another common definition is to 
require that Vi < y[ for at least one i, and both definitions are 
consistent with the theory developed in this paper. 

Definition 2 (Strong Dominance). Let y and y' be points 
in , the objective space. Then y strongly dominates y', 
denoted y -< y', if yi < y' for all i. 

Once defined the comparison operator, we can divide the 
space y in three sets: an estimated Pareto frontier, the set of 
points strongly dominated by the estimated frontier, and the 
set of points not strongly dominated by the estimated frontier. 

Definition 3 (Estimated Pareto Frontier). A path-connected set 
of points F C is said to be an estimated Pareto frontier if 
no point in it strongly dominates another point also in F, that 
is, Vy € F, $y' G F: y' y, and every point in the objective 
space except for F either strongly dominates or is strongly 
dominated by a point in F, that is, Vy G — F, 3y' G 
F-. y <y' y y' <y. 

A set S is path-connected if there is a path joining any 
two points X and y in S' and a path is defined by a continuous 
function p: [0,1] S withp(O) = x andp(l) = y. Therefore, 
if there is a continuous path of points in S that gets from any 
X G S to y G S, then S is path-connected. Based on this 
definition, an estimated Pareto frontier F divides the objective 
space in three disjoint sets: points strongly dominated by 
points in F, points that strongly dominates points in F, and 
F itself. 
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Figure 1; Example of the definitions for a particular multi¬ 
objective problem. The estimated strict Pareto frontier Fg is 
shown in a solid blue line, the estimated Pareto frontier F 
includes the solid and dashed blue lines, the dominated region 
D is shown on the top right red area, and the non-dominated 
region D is shown on the bottom left green area. 

Definition 4 (Estimated Strict Pareto Erontier). A set of points 
Fg C is said to be an estimated strict Pareto frontier if no 
point in it dominates another point also in Fg, that is, \/y £ 
Fg, $y' € Fs,y' ^ y: y' ^ y, and every point in the objective 
space except for Fg either dominates or is dominated by a point 
in Fg, that is, Vy £ — Fg, 3y' & Fg-. y ^ y' \J y' < y. 

Definition 5 (True Pareto Erontier). An estimated strict Pareto 
frontier F* is a true Pareto frontier if and only if, for all points 
in F*, there is no other feasible point in the objective space 
that dominates the point in the frontier, that is, Vy £ F* ,$x G 
X,g{x) 7 ^ y: g{x) A y. Moreover, for a given problem, the 
true Pareto frontier is unique. 

The estimated Pareto frontier of Definition [3 is a gener¬ 
alization and an approximation of the true Pareto frontier 
in two ways: i) if the true Pareto frontier is discontinuous, 
then dominated points are added so that the estimated Pareto 
frontier F is path-connected while also guaranteeing that no 
point in it strongly dominates any other; and ii) the estimated 
Pareto frontier is simply a set of points that divide the space 
into dominated and non-dominated regions, without stating 
anything about the optimality of its points. 

Consider, for instance, a problem where one of the objec¬ 
tives is given by 

a; -f 1, a; > 1 
X, otherwise, 

and the other is given by g 2 {x) = —x. Then the true Pareto 
frontier F* is given by 

F* ={(x + 1, —x) I a; £ R, a; > 1} 

U {(a;, —x) I a; £ K, a; < 1}, 

which clearly is not path-connected. However, if we add the set 
of points F = {(y, — 1) | y £ (1,2]} to F*, then the resulting 


path-connected set E" = F* iJF satishes Dehnition [2 despite 
the fact that every point in F is dominated by (1,1) £ F*, 
but not strongly dominated by it. 

Pigure[T] shows an estimated strict Pareto frontier Fg, which 
coincides with the true Pareto frontier F* in this example, 
and the path-connected estimated Pareto frontier F for this 
problem. This makes it clear that the estimated Pareto frontier 
F can contain the true Pareto frontier F*, i.e. F* C F, while 
providing a path-connected ID manifold that splits the whole 
objective space R^. Of course, these properties of the estimated 
Pareto frontier are extensible to M > 2 objectives. 

With the dehnition of an estimated Pareto frontier, the 
objective space is divided into two sets, named dominated and 
non-dominated sets, also shown in Pig. [T] 

Definition 6 (Dominated Set). The dominated set D for an 
estimated Pareto frontier F is the set of all points in R^ 
where, for each one of them, there is at least one point in F 
that strongly dominates it, that is, D = {y £ R^ | 3y' £ 
F: y' ^ yj. 

Definition 7 (Non-Dominated Set). The non-dominated set D 
for an estimated Pareto frontier F is the set of all points that 
are not in F or D. This implies that D = {y £ R^ | By' £ 
F-.y< y'}. 

Note that, from the definition of strong dominance, both D 
and D are open and unbounded sets, with boundaries defined 
by the estimated Pareto frontier F. Purthermore, if F contains 
the true Pareto frontier, then the points in D are not achievable 
due to the objectives’ definitions. 

Prom the partition of the objective space in three sets, 
one estimated Pareto frontier, one dominated and one non- 
dominated set, we can define a score function similarly to ini, 

na. 

Definition 8 (Score Punction). A score function /(y): R^^ ^ 
R for a given estimated Pareto frontier F is a function that 
satisfies 

fiy) = 0, Vy £ F, 

/(y) >0, Vy £ D, 

/{y)<0, Vy£D. 

Therefore, a score function provides a single value that 
places its argument in relation to the estimated frontier. 
Moreover, for a given estimated Pareto frontier F, there are 
many possible choices of score functions /(y) that satisfy the 
definition and all of them uniquely define F based on their 
solution set /(y) = 0. This allows a score function to work 
as a surrogate for the estimated Pareto frontier. 

111. Necessary and Sueficient Conditions for 
Surrogate Score Punctions 

In this section, we will show how a score function /(y) 
can induce an estimated Pareto frontier F and the conditions 
it must satisfy so that the set it defines is indeed an estimated 
Pareto frontier, that is, no point in it strongly dominates any 
other point in it. 
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The main theory developed is based on the most general 
notion of a function /, but the conditions may be hard to 
evaluate for a general case. Therefore, we will also provide 
corollaries that prove the results for functions with additional 
constraints, like continuous derivatives. Since some of these 
results depend on Taylor approximations and the first deriva¬ 
tive at the required points may be zero, we must define a 
generalized gradient. 

Definition 9 (Generalized Gradient). Let h € where 
is the class of functions where the first k derivatives exist and 
are continuous, with fc > 1. Let k*{h) be the first non-zero 
derivative of h evaluated at 0, that is. 


k*(h) = arg min 

l<z<fe 


d*/i 

dx* 



where k*{h) is not defined if ft, is a constant function or no i 
satisfies the inequality. Then 


' 0 , 

A(ft) = <1 1 d'^'^^^ft 


k*{h)\ 


3a S M, Vx: ft(x) = a 

otherwise 


x—0 


is the generalized gradient operator, which is undefined if there 
is no i that satisfies the inequality. 


The role of the generalized gradient in the theory to be 
presented is to avoid issues with functions that may have 
null derivative at the points being evaluated but that are also 
increasing. Consider, for instance, the function /(x) = x^, 
whose gradient is null at x = 0. This function is strictly 
increasing, but the first-order approximation using Taylor 
series is a constant. In order to consider small changes in 
the function’s argument, we must use first non-null derivative, 
which is the generalized gradient, as it will dominate the 
approximation. 

The generalized gradient can be used in the Taylor approx¬ 
imation as h{S) = ft(0) +6'A{h) + 0{S"), where 0 < 5 ^ 1, 
S' = 5" = and O(-) is the big-O notation. 

Since the result is based on 5 being a small value, the 
exact power used to compute S' is not important for the 
approximation and the term 0{S") is dominated by the other 
factors. 

The extensions to continuous functions / rely on the gen¬ 
eralized gradient of a single-parameter continuous function /, 
derived from the original /, having different signs for opposite 
directions. However, it does not hold for functions where fc*( ) 
is even. 

For example, consider ft(x) = x^, which has fc*(ft) = 2. 
The Taylor approximation is given by h{S) Ki (5^A(ft(x)) = 
2S^ = (5^A(ft(—x)) R4 h{—S), which does not give different 
signs to different directions of x. Therefore, the two constraints 
on A(/) defined in the corollaries that follow can be viewed 
as a single constraint on A(/) plus the constraint that k*{f) 
is odd. 


A. Necessary Conditions 

The necessary conditions derived are direct applications of 
the estimated Pareto frontier’s definition and establish the basic 


ground on how to define a function / from a given estimated 
frontier. 

Lemma 1 (General Necessity). Let F be an estimated Pareto 
frontier. Let f{y ): —5> R fte a score function for F. Then 

f{y + Su) > 0 and f{y — Su) < 0 for all y € F, u £ (0,1]^, 
and 5 S R, (5 > 0. 

Proof Assume there are y, u, and S > 0 such that f{y+Su) < 
0. Let y' = y + Su, so that y -< y'. 

If f{y') < 0, then from the definition of a score function 
there is some y* G F such that y' ^ y*. From the transitivity 
of dominance, we have that y < y' F y\ which is a 
contradiction, since the point y* in the frontier cannot strongly 
dominate the point y also in the frontier. Then we must 
have f{y') = 0, which means y' £ F and also creates a 
contradiction. 

Assume that f{y — Su) > 0, and let y” = y — Su. Then we 
can similarly prove that it also creates a contradiction. 

Therefore, there are no such y, u, and S with f(y + Su) < 0 
orfiy-Su)>0. ■ 

This result is intuitive, since moving S in direction u from 
y we enter either D or D. If the function has the required 
derivatives, then the following result holds. 

Corollary 1 (Differentiable Necessity). Let F be an estimated 
Pareto frontier. Let f{y ): R'^^ —> R fte a score function for 
P- /+„(x) = f{y + xu) and^ = fiy - xu), with 
X £ [0, oo). Let A(/+„) and A(/~„) be defined for all y £ F 
and u £ (0,1]^. Then A(/+^) > 0 and A(/~^) < 0 for all 
y £ F and u £ (0,1]'^. 

Proof. Since / satisfies all conditions from Lemma[Tl we have 
that f{y + Su) > 0 and f{y — Su) < 0 for all y, u, and 5 > 0. 

In particular, let (5^1. Approximating using Taylor series, 
we have that f{y + Su) /(y) -f S'A{f+J > 0 and f{y - 
Su) ~ f{y) + S'A{ff.f) < 0, where S' is the appropriate 
power of S for the expansion. Since f{y) = 0 and S' > 0, 
then A(/+„) > 0 and A(/“„) < 0 must hold. ■ 

Although this corollary may appear to provide weaker guar¬ 
antees on /, its proof shows that the inequality constraints on 
the generalized gradient is equivalent to the direct inequalities 
on the function defined in the previous lemma. 

B. Sufficient Conditions 

Once defined how the estimated Pareto frontier relates to a 
given score function, we will show that a function that satisfies 
the results of the previous lemma and corollary in fact uniquely 
defines an estimated Pareto frontier F. 

Lemma 2 (General Sufficiency). Let f{y): R'^ —>■ R fte a 
function. Let F = {y £ R'^ | f{y) = 0} be a path-connected 
set. Let f{y Su) > 0 and f{y — Su) < 0 for all y £ F, 
u £ (0,1]'^, and (5 S R, <5 > 0. Then F is an estimated Pareto 
frontier. 

Proof. For F to be an estimated Pareto frontier, we have to 
prove that for any y,y' £ F,y y' we have y -f y'. Assume 
there are y and y' in F such that y < y'. 






5 


Let u = y' — y and 5=1. Then we have f{y + Su) = 
f{y') = 0, which violates the first inequality on /(•). Alter¬ 
natively, we have f{y' — Su) = f{y) = 0, which violates the 
second inequality. 

Therefore, there are no y and y' in F such that V -< y'^ and 
F is an estimated Pareto frontier. ■ 

The restrictions on /(y ± Su) may be hard to verify in 
general, since they must be valid for all S. However, if the 
function has the appropriate derivatives, then it becomes easier 
to check if it satishes the requirements. 

Corollary 2 (Differentiable Sufficiency). Let f{y): —>■ K 

be a function. Let F = {y G | f{y) = 0} be a path- 
connected set. Let /+„(a;) = /(y + xw) and f~^{x) = f{y- 
xu), with X G [ 0 , 00 ). Let A(/+„) > 0 and A(/~„) < 0 for 
all y G F and u G (0,1]*^. Then F is an estimated Pareto 
frontier. 

Proof. To use Lemma|2] we must prove that f{y-\-Su) > 0 and 
/(y — Su) < 0 for all y G F, m G (0,1]^, and 5 G K, 5 > 0. 

Suppose there is some y, u, and S in the domain such that 
f{y + Su) = 0. Moreover, let S be the smallest value for 
which this happens for a given y and u. Let 0 < e <C 1 
and e < S. Then f{y + eu) « /(y) -h e'A{f+J > 0 and 
f{{y + Su) — eu) ~ /(y + ^w)+ e'A(/~„) < 0, where e' is the 
appropriate power of e for the approximation. However, /(■) 
cannot go from positive to negative without passing through 
0 due to its continuity. Then there must be some S' < S such 
that /(y + S'u) = 0, which contradicts the definition of S. 

Therefore, the first inequality on Lemma holds. We can 
use a similar method to prove the second inequality, and then 
use the lemma. ■ 

Again, this corollary shows the equivalence between the 
inequalities on the function and on the generalized gradient. 

C. Necessary and Sufficient Conditions 

Since the symmetry between Lemmas [T] and is clear, 
we can build a theorem to merge those two and provide 
necessary and sufficient conditions for defining an estimated 
Pareto frontier F from a score function /(y). 

Theorem 1 (General Score Function). Let f{y) : —>• R 

be a function. Let F = {y G R^ | /(y) = 0} be a path- 
connected set. Let D = {y G R*^ | 3y' G F: y' -< y} and 
D = R_^\(F U D). Let f{y) > 0,Vy G D, and f{y) < 
0,Vy G D. Then F is an estimated Pareto frontier if and only 
if f{y-\-Su) > 0 and f{y — Su) < 0 for all y G F, u G (0, 1]^, 
and 5 G R, 5 > 0. 

Proof. Assume that the constraints on / are valid. Then, from 
Lemma we have that F is an estimated Pareto frontier. 
Now assume that F is an estimated Pareto frontier. Then, from 
Lemma [T] we have that the constraints on / are valid. ■ 

Instead of requiring knowledge of the sign of /(y) over the 
sets, we can use a more strict dehnition, requiring continuity, 
to guarantee that the result holds. 


Corollary 3 (Continuous Score Function). Let f{y): R^ —>• 
R he a continuous function where there are points and 
V- such that /(u+) > 0, f(y-) < 0, and V- -< v^. Let 
F = {y G R'^ I /(y) = 0} he a path-connected set. Then F 
is an estimated Pareto frontier if and only if f{y-\-Su) > 0 and 
f{y — Su) < 0 for all y G F, u G (0,1]*^, and 5 G R, 5 > 0. 

Proof. Assume that F is an estimated Pareto frontier. Assume 
that there are y, y' G F = {y G R'^ | 3y' G F: y' ^ y} such 
that /(y) > 0 and f{y') < 0. From the continuity of /, we 
have that there is some z G D such that f{z) = 0. However, 
since f{z) = 0, it is in F. From the definition of D, there is 
some z' G F such that z' -< z, which violates the assumption 
that F is an estimated Pareto frontier. Therefore, all points in 
D have the same sign over /. The same can be shown for D. 

Since v- -< v+, we have that v+ G D and v- G D. Then 
/ satisfies all conditions from Theorem [T] ■ 

Again, we can replace the constraints on f{y ± Su) by the 
constraint on the generalized gradient. 

Corollary 4 (Differentiable Score Function). Let 
f{y): R*^ —^ R he a function where there are points 
z)_l_ and V- such that f(v.^.) > 0, /(w-) < 0, and V- -< u+. 
Let F = {y G R^ | /(y) = 0} be a path-connected set. Let 
fy,uix) = fiy + xu) and f-^{x) = f{y - xu). Let A(/+„) 
and A(/“„) be defined for all y G F and u G (0,1]^. Then 
F is an estimated Pareto frontier if and only if A(/+^) > 0 
and A(/“„) < 0 for all y G F and u G (0,1]^. 

Proof. We can use Corollary [3 to show that the restrictions on 
f{y±Su) must hold. From Corollaries [T] and |2l we know that 
the restrictions on A(/^„) are the same as the restrictions on 
/(y ± Su), so this corollary is valid. ■ 

IV. Learning Surrogate Functions erom Samples 

After showing what conditions the function / must satisfy, 
one could ask how to build such function for a given problem 
and specially how to learn one from a given set of non- 
dominated points. This can be a hard question to answer in 
general, but we can provide an additional lemma that can help 
in many cases. 

Lemma 3 (Strictly Increasing Sufficiency). Let f{y): R^ —>• 
R he a strictly increasing function on each coordinate. Let 
F = {y G R'^ I /(y) = 0}. Then F is an estimated Pareto 
frontier. 

Proof. For F to be an estimated Pareto frontier, we have to 
prove that for any y,y' G F,y y' we have y -f y'. Assume 
there are y and y' in F such that y A y'. 

Let F = (po = y,pi,...,PM-i,PM = y') be a path 
between y and y' that increments only one coordinate at a time. 
Since / is strictly increasing, we have that f(pi ) < fiP i+l)- 
Thus /(y) < f{y'), which contradicts the premise that 
fiy) = fiy') = 0 because they are both in the frontier. 

Therefore, there are no y and y' in F where y <y' and F 
is an estimated Pareto frontier. ■ 

Note that, because / is strictly increasing, there is no point 
in F that even dominates another point in F, which was 
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allowed in Definition |3] This restriction can be relaxed to 
be only monotonically non-decreasing if one can guarantee 
that f{y) = 0 is only a manifold, and not a subspace with 
volume. If f{y) — 0 is a subspace, then we can bnd two 
points in it where one dominates the other, which violates the 
basic definition of an estimated Pareto frontier. For instance, a 
function that is monotonically non-decreasing and is constant 
in at most one dimension at a time does not create a subspace 
on f{y) = 0. 

Nonetheless, this lemma can be used as a guide on how to 
build a function for the general case. We will build a model 
that tries to approximate an estimated Pareto frontier from 
a few of its samples using an approximated monotonically 
increasing function based on Gaussian processes. 

A. Gaussian Process As a Function Approximation Problem 

Since the model should have enough flexibility to ht the 
given samples, an appropriate choice for a surrogate function 
is a Gaussian process, which always has enough capacity to ht 
the data. Before describing how a Gaussian process is used to 
approximate the Pareto frontier, we provide the reader with an 
overview of how they work. For a more detailed description, 
we refer the reader to Ea. 

A Gaussian process (GP) is a generalization of the multi¬ 
variate normal distribution to inhnite dimensions and can be 
used to solve a regression problem. A GP dehnes a probability 
distribution over functions, such that the outputs are jointly 
normally distributed. 

To better understand this concept, consider an inhnite col¬ 
umn vector y £ and an inhnite matrix x £ Then 

a function /: R can be described by associating the 

row indexes, such that f{xi) = yi. The GP relies on the fact 
that the relationship between x and y can be written as: 

y M{pix),K{x)), (1) 

which states that all dimensions of y are distributed according 
to a multivariate normal distribution with mean pL{x) and 
covariance K{x). Moreover, the mean for a given dimension 
is given by E[j/i] = y,{xi) and the covariance is given by 
Cov{yi,yj) = k{xi,Xj), where fc(-, •) is a positive semi- 
dehnite kernel function. 

Although continuous functions, and thus Gaussian pro¬ 
cesses, are dehned for an inhnite number of points, which 
caused the vectors x and y to have inhnite dimensions, only 
a hnite number of observations are actually made in practice. 
Let N be such number of observations. Then, by the marginal¬ 
ization property of the multivariate normal distribution, we 
only have to consider N observed dimensions of x and y. 
Furthermore, the hnite-dimension version of y is still normally 
distributed according to Eq. O when considering only the 
observed dimensions. 

Usual choices for the mean and covariance functions are 
the null mean Ea, such that y,{x) = 0, and the squared 
exponential kernel, dehned by: 

k{x,x) = rfexp\--^^ - ^2 - 1 - ( 2 ) 


where rj, pi > 0 and pi are the scale parameters, which dehne 
a representative scale for the smoothness of the function. 

The choice of the kernel function establishes the shape 
and smoothness of the functions dehned by the GP, with the 
squared exponential kernel dehning inhnitely differentiable 
functions. Other choices of kernel are possible and provide 
different compromises regarding the shape of the function 
being approximated, such as faster changes and periodicity of 
values. However, in order to use the monotonicity constraints 
introduced in Section IIV-BI the kernel has to be at least twice 
differentiable, which limits the kernels that can be used. 

Figure |2a] shows the prior distribution over functions using 
the squared exponential kernel with p = 1, p = 0.5, D — 1, 
and the zero mean. This highlights the fact that the GP defines 
a distribution over functions, not a unique function. Three 
sample functions from this GP are also shown in the same 
figure. Note that the functions are not shown as continuous, 
which would require an inhnite number of points, but as hnite 
approximations. 

To use the GP to make predictions, the observed values of 
X are split into a training set X, whose output Y is known, 
and a test set AT*, whose output U, we want to predict. Since 
all observations are jointly normally distributed, we have that 
the posterior distribution is given by: 

yj,|X*,X,y-A/'(p*,E*) (3a) 

p* = K{X,,X)K{X, X)-^Y (3b) 

S, = K{X^,X^) - K{X^, X)K(X, X)-^K{X, X^), (3c) 

where are matrices built by computing the kernel 

function for each combination of the arguments values. 

The posterior distribution for the previous GP, after four 
observations marked as black dots, is shown in Fig. |2b] Note 
that the uncertainty around the observed points is reduced 
due to the observation themselves, and the mean function 
passes over the points, as expected. Again, three functions 
are sampled from the posterior, and all agree on the value the 
function must assume over the observations. 

In order to avoid some numerical issues and to consider 
noisy observations, we can assume that the covariance has 
a noisy term. Assuming that yi = f{xi) + Ci, where is 
normally distributed with zero mean and variance cr^, then 
the covariance of the observations is given by Cov{yi,yj) = 
k{xi, Xj) -I- a^5ij. The noiseless value li = f{xi) can then be 
estimated by: 

L^\X,,X,Y (4a) 

p, = K{X^,X)nY (4b) 

= K{X^,X^) - K{X^,X)nK{X,X^) (4c) 

n= [K{X,X) + a'^l]~\ (4d) 

which is similar to Eq. ©, except for the added term in H 
corresponding to the noise. 

B. Gaussian Processes with Monotonicity Soft Constraint as 
Surrogates 

Just like in the previous section, we consider the null mean 
function p{x) = 0 and the squared exponential kernel dehned 
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X X 

(a) Before observations (b) After observations 

Figure 2; Function distribution using a Gaussian process. Before the observations, the distribution is the same over all the 
space. After the observations, the distribution adapts to constraint the possible functions. The distribution mean is given by 
the black line and the 95% confidence interval is given by the shadowed region. Three function samples are also provided for 
each case. 


in Eq. Since we are mapping from the objective space 
to a value in M, according to Definition!^ the input values are 
the objectives y and the outputs the scores z. 

Let Y e be a set of N input points and Z S 

their desired targets for training. We define the latent variable 
L between the two, such that 

L\X r^MiO,K{Y,Y)), 

where K{Y,Y)ij = k{yi,yj). The latent variable then pro¬ 
duces the observed values Z through 

where I is the identity matrix. 

This model is the same as the one described in Section lTV-AI 
However, only the mean prediction will be used in this paper 
to describe the estimated Pareto frontier. Moreover, we will 
show how changing the allowed noise level a affects the Pareto 
frontier approximation. 

Besides the observations of f{y) at the desired points, 
the GP framework also accepts observations of its derivative, 
since differentiation is a linear operator m, ESI, that is, the 
derivative of a GP is also a Gaussian process. However, since 
we do not know the desired value of the gradient, only that 
it should be positive, from Corollary |4| and Lemma |3| forcing 
an arbitrary value may lead to reduced performance. 

Another option is to introduce a probability distribution 
over the gradient in order to favor positive values, introducing 
monotonicity information ll20l . This new distribution can be 
viewed as adding constraints to the Gaussian process, making 
it feasible to include the monotonicity information to the 
existing framework. 

Ideally, the probability distribution over the gradient is the 
step function, which provides a probability of zero if the 
gradient is negative and the same probability for all positive 


gradients. However, the step function defines a hard threshold 
and does not allow small errors, which can cause some 
problems for the optimization. Therefore, a smooth function 
that approximates the step is used to define a soft constraint 
over the gradient. 

(i) 

Let m)i be the indication that the *-th sample is monotonic 
in the direction di. Then the following probability distribution 
can be used to approximate the step function; 



dydi ) \dydiv) 





(5a) 

(5b) 


where we assume the probit function $(•) as the derivative 
probability. Since the probit is a cumulative distribution func¬ 
tion, its value ranges from 0 to 1 and it is monotonically 
increasing, which makes it a good approximation for the step 
function. The parameter i/ allows us to define how strict the 
distribution should be, with z/ —0 approximating the step 
function or a hard constraint. In this paper, following the 
suggestion of ESI, we use z/ = 10 

Since the monotonicity probability is not normal, it has 
to be approximated by a normal distribution to be used in 
the GP framework. To understand this, first consider the 
problem without the monotonicity constraints, which is given 
by Eq. (|4|i. The probability distribution of the observation is 
given by: 


p{L,\X,,X, Y) 


J piL4X,,X,L)piL\X,Y)dL, (6) 


where L is the latent variable for the training data, whose 












(e) Concave, ^ 0 


(f) Convex, —>• 0 


Figure 3: Contours for the f{y) learned using a Gaussian process with derivative constraint. The black dots are the frontier 
points provided. 
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yi 

(a) Concave, = 0.1 



yi 

(b) Convex, P = 0.1 


Figure 4: Contours for the f{y) learned using standard Gaussian process. The black dots are the frontier points provided. 


probability distribution, computed by the Bayes’ rule, is 


p{L\X,Y) 


p{Y\L)p{L\X) 

piY\X) 


piY\X) = J piY\L)p{L\X)dL. 


According to the model, the prior p{L\X) and the likelihoods 
p{Y\L) and X,L) are normal distributions, which 

makes all integrals tractable and all other distributions defined 
in the closed form presented in Eq. |4] 

Now, considering the monotonicity constraints, let Ai be 
the monotonicity constraints and L' be the random variable 
associated with the derivative of the latent variables L. Then 
the probability distribution in Eq. (|5]l can be written as 
p{Ai\L'). Rewriting the posterior distribution over the latent 
variables, we get: 


.nc... p{M\L')p{Y\L)piL,L'\X) 

p{L\X, Y,M) = - p^r^MlX) - 

p(Y,MlX)= j p{M\L')p{Y\L)p{L,L'\X)dLdL'. (7b) 


Because the distribution p{Ad\L') is not normal and every 
other distribution in Eq. (|7]i is normal, the integrals defined in 
Eqs. (|6]l and (iTbl i are intractable. Therefore, the distribution 
p{Ai\L') must be approximated by a normal distribution, 
which can be achieved using the expectation propagation al¬ 
gorithm ED, with the update equations described in ll20l . The 
expectation propagation algorithm iteratively adjusts an unnor¬ 
malized normal distribution to locally approximate the distri¬ 
bution defined by the soft constraints, such that p[M.\L') « 
ZJ\f{L'\'jl,Tj), where Z is a normalization constant, p is a 
mean vector with one value for each monotonicity constraint, 
and S is a diagonal covariance matrix. 

Besides this monotonicity constraint, we also would like 
that the errors between the provided values for the points z 
and their latent values I are small, so that the estimated shape 
of the Pareto frontier is closer to the true one. This can be 


achieved by placing a prior inverse-gamma distribution over 
cr^, whose density is given by: 

/ -a-l f 

P{X] a, P) = exp - , 

r(a) V xJ 

where r(-) is the gamma function. As /3 ^ oo, this prior is 
ignored, while P ^ 0 indicates that there is no noise. In the 
results shown, we fix a = 3 and vary p. 

We define /(y) as the final expected value E[l*\y*, Z, T, 9], 
and the parameters 6 are optimized to maximize the full 
likelihood, including gradient probability and cr^ prior, of the 
training data Y and Z. We also add the monotonicity constraint 
on all training data for all directions, but it should be noted 
that we can also add only monotonicity constraint at a point 
without defining its desired value. This allows us to find points 
that have f{y) =0 but negative gradient and add the constraint 
on them, which in turn could improve the estimation. 

To test the GP’s performance as a surrogate, we con¬ 
sider the two test frontiers whose samples are given by 
Pi = [(0,1), (e, e), (1,0)], which is a convex frontier, and 
P 2 = [(0,1), (1 — e, 1 — e), (1, 0)], which is a concave frontier, 
both with e = 10“^. Note that the points were purposely 
selected to test the ability to model very sharp frontiers. 
However, using only the points defined by Pi and P 2 leads to 
a solution where f{y) is almost 0 everywhere. To avoid this 
problem, we add a point (1,1), with target value 1, to Pi and 
a point (0,0), with target value —1, to P 2 . The parameters for 
the Gaussian process are found using gradient ascent in the 
samples likelihood. 

Eigure [3 shows the resulting curves for different values of 
P. The first thing we notice is that, although P ^ 00 does not 
place any restriction on a, which allows the observed points 
in the frontier to be far from their latent values that actually 
define the frontier, the resulting curve is still able to fit the 
general shape defined by the points provided. 

As we reduce the value of P, the observed variance is 
required to be smaller and the frontier shape gets better and 
better. Ideally, with /3 = 0, the latent points would be the same 
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as the observed points, but this causes numeric problems due 
to the monotonicity information and can make it harder to 
satisfy the monotonicity constraint, due to the smoothness of 
the GP. 

When we reduce the value to /3 = 0.01 and beyond, 
the resulting frontier is not valid anymore, with noticeable 
points with negative derivative. However, the largest difference 
in the concave problem is between points (0.82,1.055) and 
(0.2,0.985), with a total reduction in 1/2 of just 0.07, and a 
similar result is obtained for the convex case. Therefore, this 
approximation is still close to the correct frontier and could be 
used to evaluate proposed solutions because it was built with 
the theoretical developments of this paper in mind and tries to 
approximate them, which most likely provides better frontier 
estimates than methods that use traditional regression solu¬ 
tions, such as Qo), im, m, where the manifold f{y) = 0 
can have any shape. 

To evaluate the effect of using the gradient constraint. 
Fig. |4] shows a similar GP but without any information on the 
gradients. Although the expected Pareto frontier is correctly 
identified, there are also many points that do not belong to 
the frontier and where f{y) = 0. Since the unconstrained GP 
had better frontier estimates for the extreme points than the 
constrained GP, as all points between them and the knee satisfy 
the conditions, it appears that not every point benefits from the 
gradient constraint. 

Even though both GP models failed to fully satisfy the 
theoretical conditions, we consider that the GP with derivative 
restriction performed better, both because there are some 
parameter sets that are able to satisfy the frontier conditions 
and because it does not violate the restrictions as much. 
Moreover, if the variance, which is not shown but is higher 
for points far from the inputs provided, is taken into account, 
then the violations of the GP with derivatives occur in a region 
with higher uncertainty than the violations of the pure GP. 

Therefore, despite the minor violations of the GP with 
derivative constraints, this approximation is still close to the 
correct frontier and could be used to evaluate the proposed 
solutions. 


C. Comparison to Existing SVM Surrogate 

The surrogate method introduced in ifTOl . like the method 
proposed in this paper, is based on approximating the frontier 
directly from values in the objective space. This makes it a 
good candidate for comparison and validating the conjecture 
that existing methods may arbitrarily violate the conditions 
described in this paper. 

The one-class SVM used in ITOl is defined by the following 
optimization problem: 


min 



1 


N 


-P 


S.t. XtF(j){Xi) > p - C* 




where v € ( 0 , 1 ] and the feature-extraction function (j>{x) is 
defined implicitly by the kernel 

K{x, y) = exp(- 7 ||a: - t/f), 


which is similar to the kernel used for the GP. 

One important difference between training an SVM and 
a GP is that the GP has a natural way to optimize its 
hyper-parameters by maximizing the data likelihood, which 
automatically defines a trade-off between fitting the data and 
model complexity. For the SVM, we must use cross-validation 
122 , which reduces the number of points available to fit the 
model, since the data must be divided in the training and 
validation sets. 

To compare the surrogate methods, we use one test problem 
from 0 , which is also used in cni to show the behavior of 
the proposed SVM surrogate. The problem is given by: 

min /i(a:i,a: 2 ) = xi 

min f 2 (,Xi,X 2 ) = 1 + X 2 — xl — 0.2sin( 37 ra:i) 
s.t. Xl G [0,1], X 2 G [-2, 2]. 

We chose this problem because its true Pareto frontier is 
discontinuous, which creates sharp changes in its associated 
estimated Pareto frontier, just like in Fig. [1] and makes it 
harder to approximate. 

We chose u = 10“^ so that the samples provided should 
be almost perfectly classified and we constraint the scales pi 
in Eq. |2 to be equal, so that both methods can use the same 
features from the samples. The data set provided is composed 
of a grid with step 0.05 for both variables, which includes 
some points in the Pareto frontier. The full grid is used to fit 
the SVM because it provided better results than using just the 
non-dominated points, while only the non-dominated points 
are required for the GP. 

Figure |5] shows the resulting approximations of the Pareto 
frontier using a GP with parameters learnt through gradient 
ascent in the data likelihood, like in Section HV-Bl and an SVM 
with different values of 7 . The GP learns an appropriate shape 
from the samples provided despite the discontinuity in the 
frontier, but also slightly violates the constraints during the gap 
in /i G [0.3,0.5]. Moreover, in the absence of any information 
about the shape in the interval fi G (0.9, 1], because no point 
was provided there, the GP extrapolates a valid shape for the 
Pareto frontier. 

The SVM is highly dependent on the parameter 7 . When 
it is small, the shape learnt is very conservative and does 
not follow the shape defined by the points in the frontier. 
On the other hand, when it is large, the surrogate fits the 
points in the frontier better but also may define a function that 
violates greatly the conditions to be a valid Pareto frontier. 
The best value for 7 that does not violate the constraints 
in the interval /i G [0,0.9] is 7 = 5. However, for this 
value the GP provides a better approximation of the Pareto 
frontier, as shown in Fig. |50 Increasing 7 provides a better 
approximation, achieving a quality comparable to the GP, but 
also creates regions that violate the conditions to be a valid 
Pareto frontier more than the GP. Furthermore, 7 = 6 defines 
a region that the SVM believes is part of the Pareto frontier 
but actually is very distant from it and inside the dominated 
region, as shown in Fig. [52 

Besides these issues, the SVM also does not extrapolate 
well to the region /i G (0.9,1]. Close inspection shows that 
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Figure 5: Estimated frontiers using SVM with different values of 7 and Gaussian process. The points in the data set that belong 
to the true Pareto frontier are shown as dots. 


the dominated region defined by the SVM is finite, that is, 
it is described by a region in the objective space that is 
surrounded by an infinite region that the SVM believes is 
not dominated. This behavior shows that the learnt model 
carries no concept of the problem it is solving, which is to 
approximate a Pareto frontier, but describes a generic function 
approximation. The results in Fig. |5] provide evidence for the 
conjecture that existing methods proposed in the literature may 
arbitrarily violate the conditions described in this paper. 

Furthermore, if only the points at the Pareto frontier were 
provided for learning, then the region defined by the SVM 
would enclose only these points and would ignore the dom¬ 
inated region. Thus the SVM method requires data in the 
dominated region while the GP method only requires the 
points at the frontier. 

V. Conclusion 

In this paper, we have introduced the necessary and suf¬ 
ficient conditions that functions must satisfy so that their 
solution space describes an estimated Pareto frontier. These 
conditions follow from the definition of an estimated Pareto 
frontier and are extended for differentiable functions, which 
allows easier verification of the conditions. 


Based on these conditions, a Gaussian process (GP) was 
tested on toy problems with very sharp Pareto frontiers. The 
GP was extended to include the theoretical conditions as soft 
probabilistic constraints and a regularization term was added 
to avoid large deviations between the points and their latent 
values. The mean latent value is used as surrogate for the 
Pareto frontier, and some values of the regularization constant 
allowed a correct frontier estimate to be found. 

However, when the regularization becomes too strong, the 
surrogate violates the constraints that define a valid estimated 
Pareto frontier on some points, but this occurs far from the 
given inputs and the deviation is small. This suggests that, even 
under these conditions, the proposed function could be used 
to provide insight on the shape of the true Pareto frontier, and 
possibly provide more realistic estimates than other methods 
that do not take the restrictions into consideration during their 
design. 

To validate this hypothesis and the conjecture that existing 
surrogate methods may violate the conditions described in this 
paper, we compared the proposed GP with a one-class SVM 
used in ifTOl on one of the test problems described in the same 
paper. We showed that the GP again violates the constraints 
by small values and provide a good estimate for the Pareto 
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frontier, while the SVM defined a worse estimate or violated 
the conditions more than the GP. Furthermore, the dominated 
region defined by the SVM is bounded by what it represents 
as the non-dominated region, while the GP correctly divides 
the space in two infinite areas. 

Besides being a better surrogate for the Pareto frontier, the 
GP has the data likelihood as an innate measure that can 
be used to optimize its hyper-parameters and only requires 
data at the frontier. On the other hand, the SVM must use 
some method, like cross-validation ll22l . to optimize its hyper¬ 
parameters and it requires data in the dominated region to 
define better approximations. 

We highlight that, although GP were used together with the 
theory on this paper to approximate the Pareto frontier, the 
theory is general and does not depend on the specific choice 
of the function descriptor. Therefore, other models that are 
able to deal with the constraints imposed by the theory, in 
either a soft or hard way, should be able to learn the desired 
shape of the Pareto frontier too. Nonetheless, we are not aware 
of any other method to create the score function in which the 
constraints are as easy to include as in the GP. Additionally, 
a GP provides robustness to changing the number of points 
used in the estimation. 

Further investigations involve studying the behavior of the 
GP to approximate the Pareto frontier with real benchmarks 
and using some multi-objective optimization algorithm, such 
as NSGA-II 1221, to provide the points. Since the objectives 
tend to be smoother than in the example frontier provided 1241 . 
we expect the estimated Pareto frontier described by a GP to 
fit the true Pareto frontier even better in these problems. If this 
is the case, we will investigate the possibility of integrating 
the frontier surrogate with other surrogate models for the 
objectives, so that all of them are learned directly and the 
number of function evaluations could be reduced. 

Moreover, since the only requirement for the surrogate is 
that the Pareto frontier is approximated by the null space 
and the exact value on other parts of the objective space are 
not relevant, the GP could be used to fit a regression model 
on the individuals of a population where the target value 
is monotonically increasing in the objective space. Standard 
performance measures in multi-objective optimization, such 
as the class in non-dominated sorting 1231 and the dominance 
count l25l . satisfy this property and can be used as targets of 
the regression. In this case, the GP would not only define the 
Pareto frontier, but would also define a measure of the distance 
between a given point and the approximated Pareto frontier. 

Another interesting line of research is to evaluate when the 
derivative constraints on the points provided is beneficial, since 
in some points it avoids incorrect association of other points 
with the frontier, like around the knee in the unconstrained 
GP shown in this paper, and in others it may make the 
estimated shape not satisfy the constraints, like the points in 
the constrained GP also shown in this paper. This could not 
only provide better fit, but may also increase the fitting speed, 
since less constraints need to be evaluated, which reduces the 
size of the GP and the number of expectation propagation 
steps required. Therefore an iterative algorithm that adds the 
constraints as needed should be pursued. 
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